feat: programmatic tool calling (the `script` tool) by agjs · Pull Request #50 · agjs/tsforge

agjs · 2026-06-26T18:36:19Z

What

Adds an opt-in `script` tool (`TSFORGE_SCRIPT=1`) — Programmatic Tool Calling. The model writes ONE TypeScript program that calls tools through generated `./tsforge-tools` stubs, collapsing a mechanical multi-step tool chain into a single model turn. Only the script's stdout returns to the model; intermediate results never enter context.

Why

Attacks the one axis the correctness gate is blind to — token + latency cost. Exploration/multi-file work that today costs N model turns (read 8 files, fetch+compare packages, transform-then-write across files) becomes one turn. Inspired by hermes-agent's PTC.

How (no new powers, just ergonomics)

Each stub call POSTs `{tool,args}` to a loopback RPC server that dispatches through the existing `executeTool` chokepoint — so scope, the unified policy, the write-guard, mutation accounting, and the gate all still apply. Same trust level as `run`. Bounded by a wall-clock timeout (kill), a per-script call cap, and output condensing. RPC subset excludes scaffolds/installer/yield and `script` itself (no recursion); token-gated and serialized. Not advertised in plan mode and rejected at dispatch there.

Also adds `TOOL_SPECS` — one source of truth for per-tool flags (`readOnly`, `scriptExposable`); `READ_ONLY_TOOL_NAMES` + the script-exposable subset now derive from it.

Tests

15 new (`script-tool.test.ts` + gating/accounting): stub generation, single-turn batching + call accounting, real in/out-of-scope writes through `executeTool`, plan-mode rejection, call cap, timeout kill, token/recursion/non-exposable guards, serialization, registry equivalence. Full `bun run validate` green (1616 pass).

Eval status (gating this PR)

✅ Real-model smoke (DeepSeek, `fix-regression`, script on): 100% pass, no regression — harness runs end-to-end with the tool.
⏳ Win-proving A/B (`TSFORGE_FEATURE_VARIANTS=script`, TTSR + tokens at equal pass-rate, on exploration-heavy seeds) — to run before merge. Ships only if the sweep shows a real, non-regressing cost reduction.

🤖 Generated with Claude Code

…ag registry) Add an opt-in `script` tool (TSFORGE_SCRIPT=1) that lets the model write ONE TypeScript program calling tools through generated `./tsforge-tools` stubs, collapsing a mechanical multi-step tool chain into a single model turn. Only the script's stdout returns to the model — intermediate results never enter context. Mechanism: doScript writes the stub module + the model's code to a temp dir, starts a loopback RPC server (Bun.serve on 127.0.0.1, one-time token), and runs `bun run script.ts`. Each stub call POSTs {tool,args} back to the server, which dispatches through the EXISTING executeTool chokepoint — so scope, the unified policy, the write-guard, mutation accounting, and the gate all still apply. The model gains ergonomics, not new powers (same trust level as `run`). Bounded by a wall-clock timeout (kill), a per-script tool-call cap, and output condensing. The RPC subset excludes the scaffolds, the dependency installer, yield, and `script` itself (no recursion); requests are token-gated and serialized so concurrent stub calls can't interleave a mutation. Not advertised in plan mode and rejected at dispatch there, so the "no writes while planning" guarantee holds. Also introduces TOOL_SPECS in agent.constants.ts — one source of truth for per-tool flags (readOnly, scriptExposable) that READ_ONLY_TOOL_NAMES and the script-exposable subset now derive from, replacing hand-kept sets. Tests: 15 new (stub generation, single-turn batching + call accounting, real in/out-of-scope writes through executeTool, plan-mode rejection, call cap, timeout kill, token/recursion/non-exposable guards, serialization, registry equivalence). Full `bun run validate` green (1616 pass). Eval validation (A/B: TSFORGE_SCRIPT off vs on, TTSR + token cost at equal-or-better gate pass-rate) to follow — the feature ships only if the sweep shows a real, non-regressing cost reduction.

gemini-code-assist

Code Review

This pull request introduces a new script tool that enables programmatic tool calling by executing a TypeScript program to batch multiple tool calls into a single turn. The feedback highlights two critical issues: first, creating the temporary directory in the system temp folder breaks module resolution for project dependencies and can leak resources on server initialization errors; second, executing multiple writes within a single script turn bypasses the write-guard and touched tracking because the current tracking mechanism only records the last written file.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…r-use The first A/B showed `script` was neutral-to-negative on create-heavy tasks: the model reached for it on trivial independent creates (which it already batches into one turn), adding a cycle. Two changes fix and prove it: 1. Retarget the guidance: use `script` ONLY when the change to many files DEPENDS on first reading each file (a read→act loop the model otherwise splits into a read turn + an edit turn). Explicitly steer away from independent batch-creates/edits. This kills the over-use. 2. Add the `migrate` eval seed — a brownfield codemod (8 services, each edited using a tier read from its own header comment) — the read-dependent shape where PTC should help. A/B (DeepSeek, temp 0): - migrate (read-dependent codemod), pooled n=20/variant: off 60% pass, ~3.0 cyc, ~17s — stuck ~40% of runs on 95% pass, ~1.7 cyc, ~13s — stuck ~5% of runs (Fisher p≈0.008) - simple controls (validators/fixtures/handlers): on == off cycles, equal or slightly faster, equal quality — over-use regression gone. Given a real win on its target shape and no regression elsewhere, flip `script` to DEFAULT-ON with a `TSFORGE_NO_SCRIPT` kill switch (matching NO_LSP/NO_GIT) — no opt-in flag for users to think about. It makes no network calls, so default-on keeps eval sweeps deterministic. The sweep's `script` A/B dimension now toggles `TSFORGE_NO_SCRIPT` (inverted, like `git`). Tests updated for default-on (gating: present by default, withheld under TSFORGE_NO_SCRIPT). Full `bun run validate` green (1616 pass).

agjs · 2026-06-26T20:08:53Z

Eval: proven, now default-on

Tuned the tool after the first A/B showed over-use on trivial tasks, then re-measured (DeepSeek, temp 0).

Win — read-dependent multi-file codemod (migrate seed, pooled n=20/variant):

	Pass	Cycles	Time	Stuck
script off	60%	~3.0	~17s	~40% of runs
script on	95%	~1.7	~13s	~5% of runs

Fisher exact p≈0.008. Doing the codemod manually makes the model thrash and stall ~2 in 5 runs; a script makes it reliable, ~38% fewer cycles, ~23% faster, same quality.

No regression — simple controls (validators, fixtures, handlers, n=5 each): script on == off on cycles (1.0–1.2), equal-or-slightly-faster, equal quality. The earlier over-use (fixtures on was 2.2 cyc) is gone after the guidance retarget (now 1.2 == baseline).

Decision: flipped script to default-on with a TSFORGE_NO_SCRIPT kill switch (matching NO_LSP/NO_GIT) — no opt-in flag for users. It makes no network calls, so default-on keeps eval sweeps deterministic.

Full bun run validate green (1616 pass). New migrate seed + sweep script A/B dimension included for reproducibility.

…s (PR #50 review) Two critical issues from Gemini review: 1. The script ran from a temp dir under the system tmpdir, so a script importing a project dependency (zod, etc.) failed module resolution, and the dir leaked if startRpcServer threw before the try block. Create the temp dir inside ctx.cwd (hidden .tsforge-script-* prefix so eslint/tsc ignore it) so Node/Bun resolution walks up to the workspace node_modules + relative imports, and move the server start inside try so the dir is always cleaned up. 2. runToolCalls tracked a single wrote.path, overwritten per edit/create event — so a script that writes N files only recorded the LAST in touched and only write-guarded that one; the rest bypassed the write-guard and change-scoped rules (test-sibling-required). Collect ALL written in-scope paths in a Set and recordTouched + write-guard each. Tests: script resolves a workspace node_modules dep; a 3-file script records all three in touched (state.edits=3) and leaves no temp dir behind. validate green (1618 pass).

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread packages/core/src/loop/tools/script-tool.ts Outdated

Comment thread packages/core/src/loop/turn.ts

agjs added 2 commits June 26, 2026 21:07

test(eval): add create-heavy PTC seeds (fixtures, validators, handlers)

9b26f1b

agjs marked this pull request as ready for review June 26, 2026 20:08

agjs merged commit 79d64f2 into main Jun 26, 2026
8 checks passed

agjs deleted the feat/script-tool branch June 26, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: programmatic tool calling (the `script` tool)#50

feat: programmatic tool calling (the `script` tool)#50
agjs merged 4 commits into
mainfrom
feat/script-tool

agjs commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

agjs commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agjs commented Jun 26, 2026

What

Why

How (no new powers, just ergonomics)

Tests

Eval status (gating this PR)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

agjs commented Jun 26, 2026

Eval: proven, now default-on

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant